Question 1

a). Explore with appropriate graphical tools the correlation between the variables

Step 1: Load Data Subset Variables

# Load data
load(path(data_folder,"Residen.RData"))
Data <- Residen

# Extract variables by category
time_vars <- colnames(Data)[1:4]    # Time variables (START YEAR, START QUARTER, etc.)
project_vars <- paste0("V", 1:8)    # Project physical/financial variables (V1-V8)
economic_lag_vars <- list(          # Define economic lag groups (Lag1-Lag5)
  Lag1 = paste0("V", 9:27),  # V9-V27
  Lag2 = paste0("V", 28:46),  # V28-V46
  Lag3 = paste0("V", 47:65),  # V47-V65
  Lag4 = paste0("V", 66:84),  # V66-V84
  Lag5 = paste0("V", 85:103))  # V85-V103     
output_vars <- c("V104", "V105")     # Output variables (sales price and construction costs)


#  Define a Reusable Function for Correlation Heatmaps

plot_cor_heatmap <- function(data, vars, title = "Correlation Heatmap", 
                             k_col = 2, k_row = 2, 
                             fontsize_row = 8, fontsize_col = 8,
                              width = 800, height = 600) {
   # Define color gradient: red-white-blue
  custom_palette <- colorRampPalette(c(
    "#FF0000",  # Red (for -1)
    "#FFFFFF",  # White (for 0)
    "#00008B"   # Dark blue (for 1.0)
))(50)
  
  # Extract subset of variables
  selected_vars <- data[, vars, drop = FALSE]
  
  # Calculate correlation matrix
  cor_matrix <- cor(selected_vars, use = "complete.obs")
  
  # Plot interactive heatmap
  heatmaply(
    cor_matrix,
    colors = custom_palette,  # Apply custom color
    limits = c(-1, 1),        # Force color range to [-1,1]
    k_col = k_col,
    k_row = k_row,
    fontsize_row = fontsize_row,
    fontsize_col = fontsize_col,
    main = title,
    show_dendrogram = c(TRUE, TRUE), # Show row/column dendrograms
    width = width,
    height = height
  )
 }

Step 2: Visual ise Correlations by Category

1.Time Variables vs. Outputs

# Analyze Sales Price (V104) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V104)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Sales Price by Starting Year", x = "Year", y = "Sales Price (V104)")

# Analyze Sales Price (V104) by starting quarter  
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V104)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Sales Price by Starting Quarter", x = "Quarter", y = "Sales Price (V104)")

# Analyze Construction Costs (V105) by starting year
ggplot(Data, aes(x = as.factor(`START YEAR`), y = V105)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Construction Costs by Starting Year", x = "Year", y = "Construction Costs (V105)")

# Analyze Construction Costs (V105) by starting quarter
ggplot(Data, aes(x = as.factor(`START QUARTER`), y = V105)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Construction Costs by Starting Quarter", x = "Quarter", y = "Construction Costs (V105)")

cor_time_output <- plot_cor_heatmap(
  data = Data, 
  vars = c(time_vars, output_vars),
  title = " Correlation: Time Variables vs. Outputs")

cor_time_output

Time Variables vs. Output Variables – Observations

From the heatmap and boxplots, several patterns emerge:

  • Temporal Alignment:
    • V100: Start year (project begins)
    • V101: Completion year (project ends)
    • Correlation between V100 and V101: (r = 0.988)
  • The correlation between V100 and V101 is high(r = 0.988), indicating that project start and completion years are nearly linearly aligned. This shows the stability of construction durations across projects, where scheduling norms and timelines result in tightly coupled milestone years

  • Start Year Correlations:
    • V104: Actual Sales Price (r = 0.607)
    • V105: Actual Construction Cost (r = 0.753)
  • Completion Year Correlations:
    • V104: Actual Sales Price (r = 0.613)
    • V105: Actual Construction Cost (r = 0.769)
  • The heatmap confirms that both start and completion years are positively correlated with construction costs, and to a slightly lesser degree, with sales prices. This suggests temporal effects have a meaningful role in shaping financial outcomes, driven by changing economic conditions over time.
  • Output Relationships: Actual Sales Price (V104) and Construction Cost (V105) are strongly correlated (r = 0.796). This confirms that higher construction costs are typically accompanied by higher sales prices — likely due to budget scaling and market positioning.
  • Project Timing: The heatmap shows an almost perfect positive correlation (r = 0.988) between project start year (V100) and completion year (V101). This makes sense, as construction durations are relatively stable, leading to highly aligned project start and end times.
  • Temporal Trends: Boxplots show a general upward trend in both sales prices and construction costs across years. This trend is consistent with inflationary effects and rising input costs over time.
  • Correlation Strengths:
    • Start Year is moderately correlated with Sales Price (r = 0.607) and strongly correlated with Construction Cost (r = 0.753).
    • Completion Year shows similar correlations: r = 0.613 with Sales Price and r = 0.769 with Construction Cost.
  • This suggests that later projects tend to be more expensive to build and are priced accordingly.
  • Seasonal Effects:
    • Sales prices remain relatively stable across quarters, though Quarter 1 shows a slightly higher median.
    • Construction costs are highest in Quarter 1 and Quarter 4, possibly due to seasonal budgeting cycles or weather-related cost factors.
  • These patterns indicate that seasonal timing may impact input costs more than sale prices.

2.Project Variables vs. Outputs

cor_project_output <- plot_cor_heatmap(
  data = Data, 
  vars = c(project_vars, output_vars),
  title = " Correlation: Project Variables vs. Outputs")

cor_project_output

Projects Variables vs. Output Variables – Observations

Based on the correlation heatmap and supporting variables, we observe the following relationships:

  • Predictors of Actual Sales Price (V104):
    • V8: Initial unit price (r = 0.976), and
    • V5: Preliminary estimated construction cost (r = 0.785).
  • V8 shows a strong correlation with final sales prices, suggesting it is a reliable predictor. V5, while somewhat weaker, also contributes meaningfully.
  • Predictors of Actual Construction Cost (V105):
    • V4: Total preliminary estimated construction cost (r = 0.602)
    • V5: Preliminary estimated construction cost (r = 0.963)
    • V8: Initial unit price (r = 0.790)
  • These values indicate that V5 is a significantly more reliable predictor of V105 compared to V4, and that V8 is not only predictive of sales prices, but also strongly linked to actual construction costs.
  • Correlations with Actual Sales Price (V104):
    • V5: Preliminary estimated construction cost (r = 0.785)
    • V8: Initial unit price (r = 0.976)
  • The correlation between V104 and V8 is extremely high, suggesting that initial unit price is an excellent predictor of final sales price. While V5 also shows a strong correlation, it is lower than V8, indicating that pricing information may outperform early cost estimates in predicting actual sales outcomes.

The dendrogram and accompanying correlation heatmap reveal two distinct clusters of closely related variables:

  • Building Activity Indicators:
    • V9: Number of building permits issued, and
    • V12: Total floor area of building permits issued.

    These variables cluster tightly, reflecting a strong positive correlation, and are visually confirmed by a deep blue hue in the heatmap. This relationship is intuitive, as a higher number of building permits typically entails a greater cumulative floor area.

  • Economic and Demographic Drivers:
    • V16: Number of loans extended by banks,
    • V26: Population of the city

    These variables also appear in close proximity, with similarly strong positive correlation (deep blue), suggesting that more populous cities tend to receive more bank loans—likely due to greater economic activity and housing demand.

In contrast, both pairs exhibit weaker correlations with other variables, as indicated by lighter blue shades elsewhere in the heatmap. This clustering structure underscores two coherent real-world mechanisms:

  • Permitting volume and size are administratively and physically linked,
  • Loan issuance scales naturally with population-driven economic dynamics.
  1. Economic Variables (Time Lag Analysis)
# Step 1: Save all heatmaply objects into a list
heatmap_list <- lapply(names(economic_lag_vars), function(lag_name) {
  current_vars <- c(economic_lag_vars[[lag_name]], output_vars)
  
  plot_cor_heatmap(
    data = Data,
    vars = current_vars,
    title = paste("Economic Variables:", lag_name),
    k_col = 3,
    k_row = 3
  )
})

# Step 2: Show them one by one 
heatmap_list[[1]]
heatmap_list[[2]]
heatmap_list[[3]]  
heatmap_list[[4]]  
heatmap_list[[5]]  

Economic Variables vs. Output Variables – Observations

Based on the five heatmap visual isations showing economic variables at different time lags (Lag1 through Lag5), I found some interesting patterns about correlations between the economic indicators and project outputs:

A long-term negative relationship with economic indicators is explained by Interest Rate Effects captured by variables representing “The number of loans extended by banks” in:

  • V18: time lag 1,
  • V37: time lag 2,
  • V56: time lag 3,
  • V75: time lag 4, and
  • V94: in time lag 5.

This indicates a strong inverse relationship between banking loan volume and downstream economic activity. The consistency across all five lags suggests that tighter lending conditions constrains real estate market activites, including building permits, floor area expansions, and sales prices. Such correlations align with current events and macroeconomic theory: as borrowing costs increase, construction and investment activity tend to decelerate, hender economic growth.

  • Economic Indicator Cluster:
    • V10: Building services index
    • V11: Wholesale price index of building materials
    • V13: Cumulative liquidity
    • V14: Private sector investment in new buildings
    • V15: Land price index
    • V19: Average construction cost at completion
    • V20: Average construction cost at beginning
    • V23: Consumer price index
    • V24: CPI of housing, water, fuel & power
    • V27: Gold price per ounce
  • These variables demonstrate extraordinarily high internal correlation across all five time periods, with values ranging from 0.88 to 0.999 — approaching perfect alignment.
  • Relationships with Project Outputs:
    • Each of these indicators shows consistently strong positive correlations with V104 (Actual Sales Price) and V105 (Actual Construction Cost).
  • This indicates that economic fundamentals drive both cost structures and pricing outcomes in residential construction.

  • Cross-Lag Stability:
    • Correlations remain strong across all five time lags, unaffected by temporal shifts.
  • This suggests these indicators represent stable, long-term drivers of market behavior — providing predictive utility for both cost planning and pricing strategy.

The dendrogram and accompanying correlation heatmap reveal two distinct clusters of closely related variables:

  • Building Activity Indicators:
    • V9: Number of building permits issued, and
    • V12: Total floor area of building permits issued.

    These variables cluster tightly, reflecting a strong positive correlation, and are visually confirmed by a deep blue hue in the heatmap. This relationship is intuitive, as a higher number of building permits typically entails a greater cumulative floor area.

  • Economic and Demographic Drivers:
    • V16: Number of loans extended by banks,
    • V26: Population of the city

    These variables also appear in close proximity, with similarly strong positive correlation (deep blue), suggesting that more populous cities tend to receive more bank loans—likely due to greater economic activity and housing demand.

In contrast, both pairs exhibit weaker correlations with other variables, as indicated by lighter blue shades elsewhere in the heatmap. This clustering structure underscores two coherent real-world mechanisms:

  • Permitting volume and size are administratively and physically linked,
  • Loan issuance scales naturally with population-driven economic dynamics.

b). Please fit a linear regression model to explain the “actual sales

lr_model <- lm(V104~ . - V105,data = Data)
#summary(lr_model)
html_summary(lr_model)

Model Summary


Call:
lm(formula = V104 ~ . - V105, data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-901.15  -52.29   -1.50   45.71  645.31 

Coefficients: (32 not defined because of singularities)
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.004e+05  1.999e+05   0.502 0.615781    
`START YEAR`         -1.614e+03  3.388e+03  -0.476 0.634175    
`START QUARTER`      -4.928e+02  2.457e+03  -0.201 0.841139    
`COMPLETION YEAR`     1.521e+02  1.689e+01   9.007  < 2e-16 ***
`COMPLETION QUARTER`  5.984e+01  8.881e+00   6.738 8.36e-11 ***
V1                   -4.779e+00  2.225e+00  -2.148 0.032529 *  
V2                    6.630e-02  2.195e-02   3.020 0.002746 ** 
V3                   -2.331e-01  6.451e-02  -3.612 0.000356 ***
V4                    6.442e-03  3.570e-02   0.180 0.856905    
V5                   -6.548e-01  3.342e-01  -1.960 0.050975 .  
V6                    8.764e-02  6.322e-02   1.386 0.166727    
V7                           NA         NA      NA       NA    
V8                    1.203e+00  1.681e-02  71.579  < 2e-16 ***
V9                    2.016e-01  1.103e+00   0.183 0.855159    
V10                   1.373e+02  9.675e+02   0.142 0.887245    
V11                   1.048e+01  7.426e+01   0.141 0.887830    
V12                  -1.626e+02  6.837e+02  -0.238 0.812200    
V13                   1.329e-04  1.125e-01   0.001 0.999058    
V14                   1.799e-01  4.339e-01   0.415 0.678643    
V15                  -2.265e+01  5.008e+01  -0.452 0.651433    
V16                   8.696e-01  2.753e+00   0.316 0.752315    
V17                  -4.247e-03  1.860e-01  -0.023 0.981801    
V18                   2.328e+02  1.235e+03   0.188 0.850672    
V19                  -2.007e-01  4.647e+00  -0.043 0.965587    
V20                  -3.910e-01  1.187e+00  -0.329 0.742147    
V21                   6.422e-02  3.565e-01   0.180 0.857174    
V22                   3.682e-03  4.524e-01   0.008 0.993511    
V23                   6.413e+00  6.128e+02   0.010 0.991658    
V24                   4.270e+01  2.347e+02   0.182 0.855747    
V25                   5.267e-02  1.033e-01   0.510 0.610369    
V26                  -4.675e-04  4.268e-01  -0.001 0.999127    
V27                   9.455e-04  9.654e-03   0.098 0.922046    
V28                   5.915e-02  1.795e-01   0.330 0.741930    
V29                  -1.144e+02  8.626e+02  -0.133 0.894615    
V30                  -7.247e+00  1.452e+02  -0.050 0.960240    
V31                  -9.085e+01  3.036e+02  -0.299 0.764953    
V32                  -1.780e-03  8.287e-02  -0.021 0.982879    
V33                  -5.666e-02  3.707e-01  -0.153 0.878630    
V34                  -9.607e+00  1.077e+02  -0.089 0.929008    
V35                   9.323e-01  1.976e+00   0.472 0.637449    
V36                   1.490e-02  8.094e-02   0.184 0.854034    
V37                   7.421e+01  6.343e+02   0.117 0.906942    
V38                  -4.162e-01  6.246e+00  -0.067 0.946913    
V39                   6.458e-01  2.013e+00   0.321 0.748563    
V40                  -8.375e-02  1.877e-01  -0.446 0.655784    
V41                   1.615e-01  5.671e-01   0.285 0.776052    
V42                   1.890e+02  6.380e+02   0.296 0.767239    
V43                  -1.891e+02  8.982e+02  -0.211 0.833373    
V44                  -1.202e-02  5.531e-01  -0.022 0.982673    
V45                  -1.184e-02  3.554e-01  -0.033 0.973441    
V46                  -1.038e-03  8.216e-03  -0.126 0.899575    
V47                   1.040e-01  2.590e-01   0.402 0.688186    
V48                   1.119e+02  1.334e+03   0.084 0.933224    
V49                  -3.006e+01  4.980e+01  -0.604 0.546571    
V50                  -1.746e+02  4.440e+02  -0.393 0.694395    
V51                   6.526e-03  1.409e-01   0.046 0.963102    
V52                  -7.985e-02  4.933e-01  -0.162 0.871519    
V53                   6.439e+00  1.296e+02   0.050 0.960422    
V54                   2.405e+00  4.657e+00   0.516 0.605918    
V55                  -1.698e-02  1.631e-01  -0.104 0.917133    
V56                   3.089e+01  2.974e+02   0.104 0.917333    
V57                  -2.722e-01  2.233e+00  -0.122 0.903072    
V58                   3.539e-01  3.079e+00   0.115 0.908564    
V59                  -9.660e-02  5.540e-01  -0.174 0.861684    
V60                  -2.140e-01  1.307e+00  -0.164 0.870070    
V61                   4.198e+01  1.636e+02   0.257 0.797647    
V62                   2.040e+02  1.364e+03   0.150 0.881170    
V63                  -1.273e-01  8.479e-01  -0.150 0.880716    
V64                  -2.178e-02  3.772e-01  -0.058 0.953994    
V65                  -9.490e-04  8.072e-03  -0.118 0.906483    
V66                   6.260e-02  6.990e-01   0.090 0.928700    
V67                  -2.018e+02  7.758e+02  -0.260 0.794924    
V68                   6.947e+01  7.047e+01   0.986 0.324982    
V69                   1.248e+02  1.397e+02   0.894 0.372099    
V70                  -9.328e-03  4.437e-02  -0.210 0.833629    
V71                  -1.346e-01  4.248e-01  -0.317 0.751583    
V72                  -1.492e+01  2.247e+02  -0.066 0.947118    
V73                          NA         NA      NA       NA    
V74                          NA         NA      NA       NA    
V75                          NA         NA      NA       NA    
V76                          NA         NA      NA       NA    
V77                          NA         NA      NA       NA    
V78                          NA         NA      NA       NA    
V79                          NA         NA      NA       NA    
V80                          NA         NA      NA       NA    
V81                          NA         NA      NA       NA    
V82                          NA         NA      NA       NA    
V83                          NA         NA      NA       NA    
V84                          NA         NA      NA       NA    
V85                          NA         NA      NA       NA    
V86                          NA         NA      NA       NA    
V87                          NA         NA      NA       NA    
V88                          NA         NA      NA       NA    
V89                          NA         NA      NA       NA    
V90                          NA         NA      NA       NA    
V91                          NA         NA      NA       NA    
V92                          NA         NA      NA       NA    
V93                          NA         NA      NA       NA    
V94                          NA         NA      NA       NA    
V95                          NA         NA      NA       NA    
V96                          NA         NA      NA       NA    
V97                          NA         NA      NA       NA    
V98                          NA         NA      NA       NA    
V99                          NA         NA      NA       NA    
V100                         NA         NA      NA       NA    
V101                         NA         NA      NA       NA    
V102                         NA         NA      NA       NA    
V103                         NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 148.6 on 296 degrees of freedom
Multiple R-squared:  0.9879,    Adjusted R-squared:  0.9848 
F-statistic: 321.9 on 75 and 296 DF,  p-value: < 2.2e-16

From the summary of the linear regression model, we can find that:

  • Residual Distribution:
    • Residuals are roughly symmetric around zero (median = -1.50), indicating unbiased predictions on average.
    • The wide residual range (from -901.15 to 645.31) suggests the model struggles with extreme values.
  • This pattern reflects good central performance but potential issues with outliers.
  • Coefficient Significance:
    • COMPLETION YEAR (β = 152.1, p < 0.001) and COMPLETION QUARTER (β = 59.84, p < 0.001) are highly significant, reinforcing the importance of project timing.
    • V3 (Lot area): β = -0.2331, p < 0.001 — significant negative effect.
    • V8 (Initial unit price): β = 1.203, p < 0.001 — strong positive influence on sales price.
    • V1 (Project locality): β = -4.779, p < 0.05 — moderate negative effect.
    • V2 (Total floor area): β = 0.0663, p < 0.05 — moderate positive effect.
    • V5 (Preliminary estimated cost): β = -0.6548, p < 0.1 — marginally significant negative effect.
  • These estimates highlight the central role of physical and economic attributes in shaping final sale price.

  • Model Fit and Complexity:
    • Adjusted R² = 0.9848 indicates the model explains 98.48% of the variance in actual sales price.
    • F-statistic = 321.9 (p < 2.2e-16), confirming overall model significance.
  • While the model performs well in-sample, the inclusion of 75 predictors raises concerns about overfitting and general isability.
  • Multicollinearity and Stability:
    • 32 coefficients are undefined — a strong signal of multicollinearity among predictors.
    • VIF analysis is recommended to identify redundant predictors and stabil ise model interpretation.
  • This suggests the need for regularisation or feature reduction to improve robustness.

c). Fit a linear regression model to explain the “actual sales price” (V104) in terms of other variables (excluding the variable “actual construction costs” (V105)) using (a) backwards selection and (b) stepwise selection. Compare these two models in terms of outputs, computational time and holdout mean square error. (Hint: summarising your results in a table allows for easy comparison).

# Split the data 
set.seed(123)  # Set seed for reproducibility
train_index <- sample(1:nrow(Data), 0.8 * nrow(Data))
train_data <- Data[train_index, ]
test_data <- Data[-train_index, ]

# Define full model 
full_model <- lm(V104 ~ . -V105, data = train_data)

# (a) Backwards Selection
start_time_backward <- Sys.time()
backward_model <- stepAIC(full_model, direction = "backward", trace = 0)
end_time_backward <- Sys.time()
backward_time <- end_time_backward - start_time_backward

backward_pred <- predict(backward_model, newdata = test_data)
backward_mse <- mean((backward_pred - test_data$V104)^2)


# (b) Stepwise Selection
start_time_stepwise <- Sys.time()
stepwise_model <- stepAIC(full_model, direction = "both", trace = 0)
end_time_stepwise <- Sys.time()
stepwise_time <- end_time_stepwise - start_time_stepwise

stepwise_pred <- predict(stepwise_model, newdata = test_data)
stepwise_mse <- mean((stepwise_pred - test_data$V104)^2)


# Compare models
comparison_results <- data.frame(
  Model = c("Backward", "Stepwise"),
  Variables = c(length(coef(backward_model)), length(coef(stepwise_model))),
  R_Squared = c(summary(backward_model)$r.squared, summary(stepwise_model)$r.squared),
  Adj_R2 = c(summary(backward_model)$adj.r.squared, summary(stepwise_model)$adj.r.squared),
  Time = c(backward_time, stepwise_time),
  AIC = c(AIC(backward_model), AIC(stepwise_model)),
  BIC = c(BIC(backward_model), BIC(stepwise_model)),
  Holdout_MSE = c(backward_mse, stepwise_mse) 
  )
  
print(comparison_results)
##      Model Variables R_Squared    Adj_R2          Time      AIC      BIC
## 1 Backward        44 0.9886593 0.9867319 11.61730 secs 3793.543 3959.760
## 2 Stepwise        23 0.9886212 0.9877076 18.18133 secs 3752.539 3841.189
##   Holdout_MSE
## 1    89101.99
## 2    69494.02

d).Fit a linear regression model to explain the “actual sales price” (V104) in terms of other variables (excluding the variable “actual construction costs” (V105)) using (i) Ridge regression (ii) LASSO regression with an appropriate 𝜆 determined by cross validation. Compare these two models in terms of outputs, computational time and cross validation mean square error.

# Use the same training/testing split as previously

# Set cross-validation parameters
set.seed(123)

# Create model matrix (excluding V105)
X_train <- model.matrix(V104 ~ . - V105, data = train_data)
y_train <- train_data$V104
X_test <- model.matrix(V104 ~ . - V105, data = test_data)
y_test <- test_data$V104

### ---- (i) Ridge Regression (alpha = 0) ---- ###
start_ridge <- Sys.time()
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
end_ridge <- Sys.time()
time_ridge <- end_ridge - start_ridge

ridge_best_lambda <- cv_ridge$lambda.min
best_ridge_mod <- glmnet(X_train, y_train, alpha = 0, lambda = ridge_best_lambda)

ridge_pred <- predict(best_ridge_mod, newx = X_test)
ridge_mse <- mean((ridge_pred - y_test)^2)

### ---- (ii) LASSO Regression (alpha = 1) ---- ###
start_lasso <- Sys.time()
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)
end_lasso <- Sys.time()
time_lasso <- end_lasso - start_lasso

lasso_best_lambda <- cv_lasso$lambda.min
best_lasso_mod <- glmnet(X_train, y_train, alpha = 1, lambda = lasso_best_lambda)

lasso_pred <- predict(best_lasso_mod, newx = X_test)
lasso_mse <- mean((lasso_pred - y_test)^2)

### ---- Cross-validation plot  ---- ###
par(mfrow = c(1, 2))

plot(cv_ridge, main = "", xlab = "log(Lambda)", ylab = "CV Error")
title(main = "Ridge: Cross-Validation", line = 2.5) 

plot(cv_lasso, main = "", xlab = "log(Lambda)", ylab = "CV Error")
title(main = "LASSO: Cross-Validation", line = 2.5) 

par(mfrow = c(1, 1))

### --- Model Comparison Table --- ###
model_comparison <- data.frame(
  Model = c("Ridge", "LASSO","Backward", "Stepwise"),
  Time = c(time_ridge, time_lasso, backward_time, stepwise_time),
  `CV MSE` = c(min(cv_ridge$cvm), min(cv_lasso$cvm), NA, NA), 
  `Holdout MSE` = c(ridge_mse, lasso_mse, backward_mse, stepwise_mse),
  Predictors = c(length(coef(best_ridge_mod))-1, length(which(coef(best_lasso_mod)!=0))-1, 
                 length(coef(backward_model))-1, length(coef(stepwise_model))-1)
)

print(model_comparison)
##      Model            Time   CV.MSE Holdout.MSE Predictors
## 1    Ridge  0.2722230 secs 48080.05    92275.39        108
## 2    LASSO  0.1895859 secs 27436.48    61648.42         34
## 3 Backward 11.6173031 secs       NA    89101.99         43
## 4 Stepwise 18.1813331 secs       NA    69494.02         22

e). Please give a possible explanation of why one method is better than the other in this case,considering also results from backwards and stepwise selection.

From the comparison of the four models (Ridge, LASSO, Backward, and Stepwise selection), LASSO demonstrates the best overall performance in this case, and here’s why:
  • Prediction Accuracy:
    • LASSO achieved the lowest Holdout MSE (61,648.42), outperforming Ridge (92,275.39), Backward (89,101.99), and Stepwise (69,494.02).
    • Its CV MSE (27,436.48) was also lower than Ridge’s (48,080.05), reinforcing its robustness across folds.
    These results indicate that LASSO generalises better to unseen data than the other methods.
  • Computational Efficiency:
    • LASSO completed in just 0.125 seconds, faster than Backward (9.823s) and Stepwise (14.149s).
    • Ridge was also fast (0.240s) but underperformed in prediction accuracy.
    This makes LASSO highly scalable for larger datasets without sacrificing performance.
  • Model Sparsity and Interpretability:
    • LASSO retained 40 predictors, offering a simpler, more interpretable model.
    • By contrast, Ridge retained all 108 variables, and Backward kept 43. Stepwise retained the fewest (22), but with higher error.
  • LASSO strikes a balance by selecting key variables while maintaining predictive strength.
  • Why LASSO Outperforms Traditional Methods:
    • LASSO’s L1 penalty automatically shrinks irrelevant coefficients to zero, unlike stepwise selection which may get stuck in suboptimal variable sets.
    • Both LASSO and Ridge handle multicollinearity better than stepwise methods, which are more sensitive to correlated predictors.
  • While Stepwise is easier to interpret, its higher error and slower execution make it less practical. Ridge avoids overfitting but retains unnecessary complexity. LASSO offers the most effective trade-off between performance, simplicity, and generalisability.

Question 2

Step 1: Data Preparation

# Load the Parkinson's dataset
parkinsons_data <- read.csv(path(data_folder,"parkinsons.csv"), header = TRUE)[-1]

# Function to split and scale data
prepare_data <- function(data, seed) {
  set.seed(seed)
  n <- nrow(data)
  train_indices <- sample(1:n, 30)
  
  train_data <- data[train_indices, ]
  test_data <- data[-train_indices, ]
  
  X_train <- as.matrix(train_data[, paste0("X", 1:97)])
  y_train <- train_data$UPDRS
  X_test <- as.matrix(test_data[, paste0("X", 1:97)])
  y_test <- test_data$UPDRS
  
  X_train_scaled <- scale(X_train)
  X_test_scaled <- scale(X_test, 
                         center = attr(X_train_scaled, "scaled:center"), 
                         scale = attr(X_train_scaled, "scaled:scale"))
  
  return(list(
    X_train = X_train_scaled,
    y_train = y_train,
    X_test = X_test_scaled,
    y_test = y_test
  ))
}

a). Confirm that a linear model can fit the training data exactly. Why is this model not going to be useful?

lm_data <- prepare_data(parkinsons_data, 123)

# Fit a linear model
lm_model <- lm(lm_data$y_train ~ lm_data$X_train)
#summary(lm_model)
html_summary(lm_model, title = "trial") 

trial


Call:
lm(formula = lm_data$y_train ~ lm_data$X_train)

Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!

Coefficients: (68 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)            26.14        NaN     NaN      NaN
lm_data$X_trainX1    -555.71        NaN     NaN      NaN
lm_data$X_trainX2     -34.59        NaN     NaN      NaN
lm_data$X_trainX3     585.96        NaN     NaN      NaN
lm_data$X_trainX4    -153.57        NaN     NaN      NaN
lm_data$X_trainX5    -599.82        NaN     NaN      NaN
lm_data$X_trainX6     -24.87        NaN     NaN      NaN
lm_data$X_trainX7      51.83        NaN     NaN      NaN
lm_data$X_trainX8     170.76        NaN     NaN      NaN
lm_data$X_trainX9    -225.87        NaN     NaN      NaN
lm_data$X_trainX10     28.81        NaN     NaN      NaN
lm_data$X_trainX11    173.40        NaN     NaN      NaN
lm_data$X_trainX12    -86.58        NaN     NaN      NaN
lm_data$X_trainX13  31994.20        NaN     NaN      NaN
lm_data$X_trainX14   6745.51        NaN     NaN      NaN
lm_data$X_trainX15   1379.28        NaN     NaN      NaN
lm_data$X_trainX16 -32628.15        NaN     NaN      NaN
lm_data$X_trainX17  -3149.37        NaN     NaN      NaN
lm_data$X_trainX18    188.29        NaN     NaN      NaN
lm_data$X_trainX19    713.14        NaN     NaN      NaN
lm_data$X_trainX20   -239.04        NaN     NaN      NaN
lm_data$X_trainX21   -283.17        NaN     NaN      NaN
lm_data$X_trainX22    278.35        NaN     NaN      NaN
lm_data$X_trainX23    489.80        NaN     NaN      NaN
lm_data$X_trainX24   -125.69        NaN     NaN      NaN
lm_data$X_trainX25 -32067.51        NaN     NaN      NaN
lm_data$X_trainX26  -6759.54        NaN     NaN      NaN
lm_data$X_trainX27  -1427.30        NaN     NaN      NaN
lm_data$X_trainX28  32511.82        NaN     NaN      NaN
lm_data$X_trainX29   3023.98        NaN     NaN      NaN
lm_data$X_trainX30        NA         NA      NA       NA
lm_data$X_trainX31        NA         NA      NA       NA
lm_data$X_trainX32        NA         NA      NA       NA
lm_data$X_trainX33        NA         NA      NA       NA
lm_data$X_trainX34        NA         NA      NA       NA
lm_data$X_trainX35        NA         NA      NA       NA
lm_data$X_trainX36        NA         NA      NA       NA
lm_data$X_trainX37        NA         NA      NA       NA
lm_data$X_trainX38        NA         NA      NA       NA
lm_data$X_trainX39        NA         NA      NA       NA
lm_data$X_trainX40        NA         NA      NA       NA
lm_data$X_trainX41        NA         NA      NA       NA
lm_data$X_trainX42        NA         NA      NA       NA
lm_data$X_trainX43        NA         NA      NA       NA
lm_data$X_trainX44        NA         NA      NA       NA
lm_data$X_trainX45        NA         NA      NA       NA
lm_data$X_trainX46        NA         NA      NA       NA
lm_data$X_trainX47        NA         NA      NA       NA
lm_data$X_trainX48        NA         NA      NA       NA
lm_data$X_trainX49        NA         NA      NA       NA
lm_data$X_trainX50        NA         NA      NA       NA
lm_data$X_trainX51        NA         NA      NA       NA
lm_data$X_trainX52        NA         NA      NA       NA
lm_data$X_trainX53        NA         NA      NA       NA
lm_data$X_trainX54        NA         NA      NA       NA
lm_data$X_trainX55        NA         NA      NA       NA
lm_data$X_trainX56        NA         NA      NA       NA
lm_data$X_trainX57        NA         NA      NA       NA
lm_data$X_trainX58        NA         NA      NA       NA
lm_data$X_trainX59        NA         NA      NA       NA
lm_data$X_trainX60        NA         NA      NA       NA
lm_data$X_trainX61        NA         NA      NA       NA
lm_data$X_trainX62        NA         NA      NA       NA
lm_data$X_trainX63        NA         NA      NA       NA
lm_data$X_trainX64        NA         NA      NA       NA
lm_data$X_trainX65        NA         NA      NA       NA
lm_data$X_trainX66        NA         NA      NA       NA
lm_data$X_trainX67        NA         NA      NA       NA
lm_data$X_trainX68        NA         NA      NA       NA
lm_data$X_trainX69        NA         NA      NA       NA
lm_data$X_trainX70        NA         NA      NA       NA
lm_data$X_trainX71        NA         NA      NA       NA
lm_data$X_trainX72        NA         NA      NA       NA
lm_data$X_trainX73        NA         NA      NA       NA
lm_data$X_trainX74        NA         NA      NA       NA
lm_data$X_trainX75        NA         NA      NA       NA
lm_data$X_trainX76        NA         NA      NA       NA
lm_data$X_trainX77        NA         NA      NA       NA
lm_data$X_trainX78        NA         NA      NA       NA
lm_data$X_trainX79        NA         NA      NA       NA
lm_data$X_trainX80        NA         NA      NA       NA
lm_data$X_trainX81        NA         NA      NA       NA
lm_data$X_trainX82        NA         NA      NA       NA
lm_data$X_trainX83        NA         NA      NA       NA
lm_data$X_trainX84        NA         NA      NA       NA
lm_data$X_trainX85        NA         NA      NA       NA
lm_data$X_trainX86        NA         NA      NA       NA
lm_data$X_trainX87        NA         NA      NA       NA
lm_data$X_trainX88        NA         NA      NA       NA
lm_data$X_trainX89        NA         NA      NA       NA
lm_data$X_trainX90        NA         NA      NA       NA
lm_data$X_trainX91        NA         NA      NA       NA
lm_data$X_trainX92        NA         NA      NA       NA
lm_data$X_trainX93        NA         NA      NA       NA
lm_data$X_trainX94        NA         NA      NA       NA
lm_data$X_trainX95        NA         NA      NA       NA
lm_data$X_trainX96        NA         NA      NA       NA
lm_data$X_trainX97        NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 29 and 0 DF,  p-value: NA

The R-squared value of 1 indicates a perfect fit on the training data. However, this is likely due to overfitting, as the model is using 97 predictors with only 30 observations. In such cases, the model memorises the data instead of learning meaningful relationships, which results in poor generalisation to new data.

b). Now use the LASSO to fit the training data, using leave-one-out cross-validation to find the tuning parameter 𝜆. (This means 𝑛𝑓𝑜𝑙𝑑𝑠 = 30. You will also find it convenient to set 𝑔𝑟𝑖𝑑 = 10^𝑠𝑒𝑞(3,−1,100) and 𝑡ℎ𝑟𝑒𝑠ℎ = 1𝑒 −10). What is the optimal value of 𝜆? and what is the resulting test error?

c). State your final model for the UPDRS. How many features have been selected? What conclusions can you draw?

# General LASSO function to fit and return key results
lasso_regression <- function(X_train, y_train, X_test, y_test) {
  # Define lambda grid for tuning
  grid <- 10^seq(3, -1, length.out = 100)

  # Perform Leave-One-Out Cross-Validation
  lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1, lambda = grid,
                        nfolds = nrow(X_train), thresh = 1e-10)

  # Extract the best lambda value
  optimal_lambda <- lasso_cv$lambda.min

  # Fit final LASSO model using the best lambda
  lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = optimal_lambda)

  # Predict on the test set and compute mean squared error
  y_pred <- predict(lasso_model, newx = X_test)
  test_error <- mean((y_test - y_pred)^2)

  # Get the names of selected (non-zero) features
  coef_full <- coef(lasso_model)
  selected_features <- rownames(coef_full)[coef_full[, 1] != 0][-1]

  # Build the final model formula as a string
  formula_terms <- sprintf("%.4f * %s", coef_full[selected_features, 1], selected_features)
  formula_str <- paste("UPDRS =", round(coef_full[1,1], 4), "+", paste(formula_terms, collapse = " + "))

  # Return results
  return(list(
    lambda = optimal_lambda,
    test_error = test_error,
    selected_features = selected_features,
    n_nonzero = length(selected_features),
    formula = formula_str,
    cv_plot = lasso_cv
  ))
}


# Prepare data and fit the model
data123 <- prepare_data(parkinsons_data, 123)
result123 <- lasso_regression(data123$X_train, data123$y_train, data123$X_test, data123$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation errors vs. log(lambda)
plot(result123$cv_plot)

# Print key results
cat("Best lambda:", result123$lambda, "\n")
## Best lambda: 1.023531
cat("Test set MSE:", result123$test_error, "\n")
## Test set MSE: 20.41405
cat("Number of selected features:", result123$n_nonzero, "\n")
## Number of selected features: 4
cat("Selected features:", paste(result123$selected_features, collapse = ", "), "\n")
## Selected features: X9, X82, X83, X97
cat("LASSO model equation:\n", result123$formula, "\n")
## LASSO model equation:
##  UPDRS = 26.1423 + 0.2039 * X9 + 0.0586 * X82 + 0.3937 * X83 + 7.0572 * X97

The LASSO regression identified a lambda value (≈1.02) that minimised cross-validated MSE, resulting in a model with only four non-zero coefficients: X9, X82, X83, and X97. This demonstrates strong feature selection capability, improving interpretability and reducing overfitting risks. The resulting formula list below: \[ \text{UPDRS} = 26.1423 + 0.2039 \times X_9 + 0.0586 \times X_{82} + 0.3937 \times X_{83} + 7.0572 \times X_{97} \] With LASSO, the final model includes only four key predictors, which significantly improves interpretability. The coefficients provide direct insight into how each selected feature contributes to the UPDRS score. This sparse model performs well (MSE ≈ 20.4), showing a good balance between accuracy and simplicity.

d). Repeat your analysis with a different random split of the training and test sets. Have the same features been selected in your final model?

# Prepare data with a different seed and fit the model
data321 <- prepare_data(parkinsons_data, 321)
result321 <- lasso_regression(data321$X_train, data321$y_train, data321$X_test, data321$y_test)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Plot cross-validation result
plot(result321$cv_plot)

# Print key results
cat("Best lambda:", result321$lambda, "\n")
## Best lambda: 0.9326033
cat("Test set MSE:", result321$test_error, "\n")
## Test set MSE: 4.899441
cat("Number of selected features:", result321$n_nonzero, "\n")
## Number of selected features: 3
cat("Selected features:", paste(result321$selected_features, collapse = ", "), "\n")
## Selected features: X83, X95, X97
cat("LASSO model equation:\n", result321$formula, "\n")
## LASSO model equation:
##  UPDRS = 25.4929 + 0.2028 * X83 + 0.2533 * X95 + 9.0117 * X97
# Build a comparison table of the two models
comparison_table <- data.frame(
  Seed = c(123, 321),
  Optimal_Lambda = c(result123$lambda, result321$lambda),
  Test_MSE = c(result123$test_error, result321$test_error),
  Num_Selected_Features = c(result123$n_nonzero, result321$n_nonzero),
  Selected_Features = c(
    paste(result123$selected_features, collapse = ", "),
    paste(result321$selected_features, collapse = ", ")
  )
)

# Display the table 
print(comparison_table)
##   Seed Optimal_Lambda  Test_MSE Num_Selected_Features Selected_Features
## 1  123      1.0235310 20.414054                     4 X9, X82, X83, X97
## 2  321      0.9326033  4.899441                     3     X83, X95, X97

When We repeated the analysis with a different random seed (from 123 to 321), which changed the training-test split. The test set Mean Squared Error (MSE) for the previous dataset was 20.414054, while for the new dataset, it decreased to 4.899441. This reduction indicates that the model trained with the new dataset performs better on the test set.

The final model selected a slightly different set of features. Under seed = 123, the selected features were: X9, X82, X83, X97 Under seed = 321, the selected features were: X83, X95, X97

While some features like X83 and X97 appeared in both models, others (e.g., X9, X82 and X95) were different. This indicates that the feature selection process is sensitive to the specific training-test split and that small changes in data partitioning can lead to different optimal subsets.

The final model formula list below: \[ \text{UPDRS} = 25.4929 + 0.2028 \times X_{83} + 0.2533 \times X_{95} + 9.0117 \times X_{97} \]

  • Effect of Random Seed on Model Performance:
    • Under seed = 123, the test set MSE was 20.414.
    • Under seed = 321, the test set MSE dropped to 4.899.
    This reduction indicates that the model trained with the seed-321 partition performed significantly better on unseen data, highlighting the sensitivity of LASSO performance to training-test splits.
  • Selected Features:
    • Seed = 123: X9, X82, X83, X97
    • Seed = 321: X83, X95, X97
    While X83 and X97 were selected in both models, the presence of different variables (e.g., X9 and X95) reflects variability in feature selection due to data partitioning.
  • Interpretation:
    • The overlap suggests that certain predictors (e.g., X83 and X97) are robustly informative.
    • The differences demonstrate that elastic net models can be sensitive to small variations in data splits — a reminder to validate stability across multiple resampling configurations.
    These observations reinforce the need for careful validation when interpreting selected predictors from regularised models.

The final model fitted with seed = 321 is:

\[ \text{UPDRS} = 25.4929 + 0.2028 \times X_{83} + 0.2533 \times X_{95} + 9.0117 \times X_{97} \]

Question 3

a). Train an ElasticNet model to predict MEAN_ANNUAL_RAINFALL using 10-fold cross-validation to optimise values for 𝛼 and 𝜆.

  1. Data Preparation and Splitting
# Load the data
weather_data <- read.csv(path(data_folder,"Weather_Station_Data_v1.csv"))

# Set seed for reproducibility
set.seed(321)

# Split the data into training and test sets
train_indices <- sample(1:nrow(weather_data), size = 0.8 * nrow(weather_data))
train_data <- weather_data[train_indices, ]
test_data <- weather_data[-train_indices, ]

# Define predictors and response
X_train <- as.matrix(train_data[, -which(names(train_data) == "MEAN_ANNUAL_RAINFALL")])
y_train <- train_data$MEAN_ANNUAL_RAINFALL
X_test <- as.matrix(test_data[, -which(names(test_data) == "MEAN_ANNUAL_RAINFALL")])
y_test <- test_data$MEAN_ANNUAL_RAINFALL
  1. ElasticNet Model with Cross-Validation
# Define a sequence of alpha values
alphas <- seq(0, 1, by = 0.1)

# Perform cross-validation for each alpha
cv_results <- lapply(alphas, function(a) {
  cv.glmnet(X_train, y_train, alpha = a, nfolds = 10)
})

# Find the best model based on minimum cross-validated error
best_alpha_idx <- which.min(sapply(cv_results, function(cv) min(cv$cvm)))
best_cv_mod <- cv_results[[best_alpha_idx]]
best_alpha <- alphas[best_alpha_idx]

cat("Best alpha selected is:", best_alpha, "\n")
## Best alpha selected is: 0.3

Base on the cross validation, the best alpha of Elastic Net in this dataset is 0.3, in the next step, the alpha will be set to 0.3 to obtain the best result.

b). Plot the cross-validation results for your model

lambda_min <- best_cv_mod$lambda.min
lambda_1se <- best_cv_mod$lambda.1se

# Plot the cross-validation results
plot(best_cv_mod, main = "", xlab = "Log(lambda)", ylab = "Mean Squared Error")
title(main = "ElasticNet Cross-Validation Results", line = 2)
abline(v = log(lambda_min), col = "green", lty = 2)
abline(v = log(lambda_1se), col = "red", lty = 2)
legend("topright", legend = c("lambda.min", "lambda.1se"),
       col = c("green", "red"), lty = 2, bty = "n")

cat("The value of Lambda.min is: ", lambda_min, "\n")
## The value of Lambda.min is:  0.3456271
cat("The value of Lambda.1se is: ", lambda_1se, "\n")
## The value of Lambda.1se is:  20.7198

The plot shows the mean squared error (MSE) across different values of lambda. The green line represents lambda.min (with lowest MSE), and the red line is lambda.1se (within one standard error for simpler model).

c). Make predictions on your test set using both lambda.min and lambda.1se, show the coeff icients of each model, report the MSE and/or RMSE of both, and the number of predictors used in each model.

# Predictions
pred_min <- predict(best_cv_mod, s = "lambda.min", newx = X_test)
pred_1se <- predict(best_cv_mod, s = "lambda.1se", newx = X_test)

# Calculate MSE and RMSE
mse_min <- mean((y_test - pred_min)^2)
rmse_min <- sqrt(mse_min)

mse_1se <- mean((y_test - pred_1se)^2)
rmse_1se <- sqrt(mse_1se)

# Count non-zero coefficients
coef_min <- coef(best_cv_mod, s = "lambda.min")
coef_1se <- coef(best_cv_mod, s = "lambda.1se")
n_pred_min <- sum(coef_min != 0) - 1  # exclude intercept
n_pred_1se <- sum(coef_1se != 0) - 1

# Create summary table
comparison_table <- data.frame(
  Model = c("lambda.min", "lambda.1se"),
  Alpha = c(best_alpha, best_alpha),
  Lambda = c(lambda_min, lambda_1se),
  MSE = c(mse_min, mse_1se),
  RMSE = c(rmse_min, rmse_1se),
  Predictors_Used = c(n_pred_min, n_pred_1se)
)

print(comparison_table)
##        Model Alpha     Lambda      MSE     RMSE Predictors_Used
## 1 lambda.min   0.3  0.3456271 11412.89 106.8311              13
## 2 lambda.1se   0.3 20.7198017 13078.75 114.3624               9
# Extract and display coefficients
coef_table_min <- data.frame(
  Variable = rownames(as.matrix(coef_min)),
  Coefficient = round(as.vector(coef_min), 4)
)

coef_table_1se <- data.frame(
  Variable = rownames(as.matrix(coef_1se)),
  Coefficient = round(as.vector(coef_1se), 4)
)

cat("Coefficients for lambda.min:\n")
## Coefficients for lambda.min:
print(coef_table_min)
##                  Variable Coefficient
## 1             (Intercept)   1043.8279
## 2                ALTITUDE      0.1367
## 3    MEAN_ANNUAL_AIR_TEMP    -44.2016
## 4   MEAN_MONTHLY_MAX_TEMP     22.8969
## 5   MEAN_MONTHLY_MIN_TEMP     -5.3594
## 6  MEAN_ANNUAL_WIND_SPEED    -14.1155
## 7        MEAN_CLOUD_COVER      3.7076
## 8    MEAN_ANNUAL_SUNSHINE     -0.0120
## 9  MAX_MONTHLY_WIND_SPEED     -8.1163
## 10           MAX_AIR_TEMP    -29.2842
## 11         MAX_WIND_SPEED     -0.2383
## 12           MAX_RAINFALL     20.6347
## 13           MIN_AIR_TEMP     27.2047
## 14    MEAN_RANGE_AIR_TEMP     22.3804
cat("Coefficients for lambda.1se:\n")
## Coefficients for lambda.1se:
print(coef_table_1se)
##                  Variable Coefficient
## 1             (Intercept)    612.3961
## 2                ALTITUDE      0.1671
## 3    MEAN_ANNUAL_AIR_TEMP     -4.9645
## 4   MEAN_MONTHLY_MAX_TEMP      0.0000
## 5   MEAN_MONTHLY_MIN_TEMP     -1.0373
## 6  MEAN_ANNUAL_WIND_SPEED     -8.0770
## 7        MEAN_CLOUD_COVER      1.6533
## 8    MEAN_ANNUAL_SUNSHINE     -0.0031
## 9  MAX_MONTHLY_WIND_SPEED      0.0000
## 10           MAX_AIR_TEMP    -16.0839
## 11         MAX_WIND_SPEED      0.0000
## 12           MAX_RAINFALL     19.1592
## 13           MIN_AIR_TEMP      9.3296
## 14    MEAN_RANGE_AIR_TEMP      0.0000

d).Which model would you choose? Justify your answer.

Based on the comparison of the two models using lambda.min and lambda.1se,We evaluated two elastic net regression models with alpha = 0.3, using lambda.min and lambda.1se. The model with lambda.min had a lower MSE of 11,412.89 and RMSE of 106.83, while the lambda.1se model had a slightly higher MSE of 13,078.75 and RMSE of 114.36.

In terms of complexity, the lambda.min model used 13 predictors, whereas the lambda.1se model used only 9 predictors, indicating stronger regularisation.

This shows a typical trade-off: lambda.min provides better predictive accuracy, but with more variables in the model. lambda.1se gives a more parsimonious model (fewer predictors), which may be preferred for simplicity or interpretability, even at the cost of slightly higher error.

In real-world applications, especially when working with environmental or climate-related data, interpretability and generalisability are often more important than marginal gains in accuracy. A simpler model is also easier to communicate and deploy.

Therefore, the lambda.1se model strikes a better balance between performance and model simplicity.

The formula for the final model is as follows: \[ \begin{aligned} \hat{Y} =\ & 612.3961 \\ &+ 0.1671 \cdot \text{ALTITUDE} \\ &- 4.9645 \cdot \text{MEAN\_ANNUAL\_AIR\_TEMP} \\ &- 1.0373 \cdot \text{MEAN\_MONTHLY\_MIN\_TEMP} \\ &- 8.0770 \cdot \text{MEAN\_ANNUAL\_WIND\_SPEED} \\ &+ 1.6533 \cdot \text{MEAN\_CLOUD\_COVER} \\ &- 0.0031 \cdot \text{MEAN\_ANNUAL\_SUNSHINE} \\ &- 16.0839 \cdot \text{MAX\_AIR\_TEMP} \\ &+ 19.1592 \cdot \text{MAX\_RAINFALL} \\ &+ 9.3296 \cdot \text{MIN\_AIR\_TEMP} \end{aligned} \]

  • Model Performance Comparison:
    • The lambda.min model achieves lower error rates (MSE = 11,412.89; RMSE = 106.83) compared to the lambda.1se model (MSE = 13,078.75; RMSE = 114.36).
    This suggests that lambda.min provides stronger predictive performance on the given data.
  • Model Simplicity and Parsimony:
    • The lambda.1se model uses only 9 predictors, while the lambda.min model retains 13.
    This reduction in model complexity improves interpretability and may support better general isation to new data.
  • Trade-off and Decision Justification:
    • While lambda.min offers slightly lower error, the increase in complexity is not justified by a large performance gain.
    • In environmental and scientific applications, simpler models are often preferred when predictive power is comparable.
    For this reason, the lambda.1se model is selected as the final model for its balance between accuracy and parsimony.

The final model selected (lambda.1se, α = 0.3) is expressed as:

\[ \begin{aligned} \hat{Y} =\ & 612.3961 \\ &+ 0.1671 \cdot \text{ALTITUDE} \\ &- 4.9645 \cdot \text{MEAN\_ANNUAL\_AIR\_TEMP} \\ &- 1.0373 \cdot \text{MEAN\_MONTHLY\_MIN\_TEMP} \\ &- 8.0770 \cdot \text{MEAN\_ANNUAL\_WIND\_SPEED} \\ &+ 1.6533 \cdot \text{MEAN\_CLOUD\_COVER} \\ &- 0.0031 \cdot \text{MEAN\_ANNUAL\_SUNSHINE} \\ &- 16.0839 \cdot \text{MAX\_AIR\_TEMP} \\ &+ 19.1592 \cdot \text{MAX\_RAINFALL} \\ &+ 9.3296 \cdot \text{MIN\_AIR\_TEMP} \end{aligned} \]